A metagenomic investigation of phytoplasma diversity in Australian vegetable growing regions

Abstract In this study, metagenomic sequence data was used to investigate the phytoplasma taxonomic diversity in vegetable-growing regions across Australia. Metagenomic sequencing was performed on 195 phytoplasma-positive samples, originating either from historic collections (n=46) or during collection efforts between January 2015 and June 2022 (n=149). The sampled hosts were classified as crop (n=155), weed (n=24), ornamental (n=7), native plant (n=6), and insect (n=3) species. Most samples came from Queensland (n=78), followed by Western Australia (n=46), the Northern Territory (n=32), New South Wales (n=17), and Victoria (n=10). Of the 195 draft phytoplasma genomes, 178 met our genome criteria for comparison using an average nucleotide identity approach. Ten distinct phytoplasma species were identified and could be classified within the 16SrII, 16SrXII (PCR only), 16SrXXV, and 16SrXXXVIII phytoplasma groups, which have all previously been recorded in Australia. The most commonly detected phytoplasma taxa in this study were species and subspecies classified within the 16SrII group (n=153), followed by strains within the 16SrXXXVIII group (‘Ca. Phytoplasma stylosanthis’; n=6). Several geographic- and host-range expansions were reported, as well as mixed phytoplasma infections of 16SrII taxa and ‘Ca. Phytoplasma stylosanthis’. Additionally, six previously unrecorded 16SrII taxa were identified, including five putative subspecies of ‘Ca. Phytoplasma australasiaticum’ and a new putative 16SrII species. PCR and sequencing of the 16S rRNA gene was a suitable triage tool for preliminary phytoplasma detection. Metagenomic sequencing, however, allowed for higher-resolution identification of the phytoplasmas, including mixed infections, than was afforded by only direct Sanger sequencing of the 16S rRNA gene. Since the metagenomic approach theoretically obtains sequences of all organisms in a sample, this approach was useful to confirm the host family, genus, and/or species. In addition to improving our understanding of the phytoplasma species that affect crop production in Australia, the study also significantly expands the genomic sequence data available in public sequence repositories to contribute to phytoplasma molecular epidemiology studies, revision of taxonomy, and improved diagnostics.


Abstract
In this study, metagenomic sequence data was used to investigate the phytoplasma taxonomic diversity in vegetable-growing regions across Australia.Metagenomic sequencing was performed on 195 phytoplasma-positive samples, originating either from historic collections (n=46) or during collection efforts between January 2015 and June 2022 (n=149).The sampled hosts were classified as crop (n=155), weed (n=24), ornamental (n=7), native plant (n=6), and insect (n=3) species.Most samples came from Queensland (n=78), followed by Western Australia (n=46), the Northern Territory (n=32), New South Wales (n=17), and Victoria (n=10).Of the 195 draft phytoplasma genomes, 178 met our genome criteria for comparison using an average nucleotide identity approach.Ten distinct phytoplasma species were identified and could be classified within the 16SrII, 16SrXII (PCR only), 16SrXXV, and 16SrXXXVIII phytoplasma groups, which have all previously been recorded in Australia.The most commonly detected phytoplasma taxa in this study were species and subspecies classified within the 16SrII group (n=153), followed by strains within the 16SrXXXVIII group ('Ca.Phytoplasma stylosanthis'; n=6).Several geographic-and host-range expansions were reported, as well as mixed phytoplasma infections of 16SrII taxa and 'Ca.Phytoplasma stylosanthis'.Additionally, six previously unrecorded 16SrII taxa were identified, including five putative subspecies of 'Ca.Phytoplasma australasiaticum' and a new putative 16SrII species.PCR and sequencing of the 16S rRNA gene was a suitable triage tool for preliminary phytoplasma detection.Metagenomic sequencing, however, allowed for higher-resolution identification of the phytoplasmas, including mixed infections, than was afforded by only direct Sanger sequencing of the 16S rRNA gene.Since the metagenomic approach theoretically obtains sequences of all organisms in a sample, this approach was useful to confirm the host family, genus, and/or species.In addition to improving our understanding of the phytoplasma species that affect crop production in Australia, the study also significantly expands the genomic sequence data available in public sequence repositories to contribute to phytoplasma molecular epidemiology studies, revision of taxonomy, and improved diagnostics.

INTRODUCTION
Phytoplasmas are a diverse monophyletic clade of unculturable bacteria in the class Mollicutes within the provisional genus 'Candidatus Phytoplasma' [1].They are phloem-limited plant pathogenic bacteria that are transmitted by phloem-feeding hemipteran insects [2].Diseases associated with phytoplasma infections have been described in over 700 plant hosts globally from agriculturally and horticulturally important crops, ornamental plants, weeds, and native plants [3][4][5].
In Australia, phytoplasma-like symptoms were first recorded in the 1900s from tomato (Tomato Big Bud, TBB) [6], lucerne (Lucerne Witches'-Broom) [7], and pasture legumes (Legume Little Leaf) [8].All were thought to be vector-transmitted viruses until mycoplasma-like structures were identified by electron microscopy in fabaceous plants showing symptoms of little leaves and spindled stems [9].By the 1990s, molecular detections of phytoplasmas were being made globally [10], including in Australia [11], by PCR amplification of the 16S rRNA gene region.The molecular, PCR-based techniques facilitated the screening of many more plants for phytoplasma infections than was previously possible by microscopy and serological techniques.These PCR-based techniques involved restriction fragment length polymorphism (RFLP) analysis and/or sequencing for phytoplasma characterization and identification.Phytoplasma surveys were done in Australia using these molecular-based approaches and a large diversity of phytoplasma taxa affecting grains, legumes, fruit and vegetable crops, ornamentals, native plants, weeds and putative phytoplasma vectors were uncovered [12][13][14][15][16][17][18].To date, twelve phytoplasma 16 Sr groups and five 'Candidatus Phytoplasma' representatives have been described from Australia [19][20][21][22][23][24][25].Members within the 16SrII group are the most commonly detected phytoplasma in Australia, affecting a large number of host species [12,13,16,26].However, members within the 16SrXII group have been described as the most economically important based on their association with yield-reducing diseases in high-value crops such as grapevines and strawberries [26][27][28].Competent vector species remain to be confirmed for many of the phytoplasma taxa present in Australia, although Orosius argentatus has been shown to transmit diseases associated with 16SrII phytoplasmas [8,9,29,30].
Phytoplasma diversity and taxonomic analyses, including studies in Australia, have largely relied on the 16S rRNA gene sequence, which only offers low-resolution analyses of inter-and intraspecies diversity [1,12,31,32].Diversity assessments and species delimitation studies have also involved higher-resolution analyses of three to nine additional housekeeping genes for multilocus sequence typing (MLST) and analysis (MLSA) [24,33,34].With the decreasing cost of high-throughput sequencing (HTS) and increasing sequence data outputs, as well as the advancements in bioinformatic tools for metagenomic data assessments, phytoplasma genomes have increasingly been used to understand their taxonomy [31] and biology [35].Obtaining draft or complete phytoplasma genomes allows for higher-resolution analyses than those based on one or a few genes for diversity analyses [36].Additionally, by applying genome-based species delimitation thresholds and criteria specified for culturable bacteria, taxonomic boundaries between phytoplasma strains can be identified [36,37].
The aim of this study was to assess and update the species and genetic diversity of phytoplasmas in vegetable growing regions in the various states and territories of Australia using whole-genome-based approaches.To this end, (i) plants displaying phytoplasma associated symptoms and insects were collected from vegetable growing regions in Australia and screened by PCR for the presence of phytoplasma infections, (ii) DNA from key historical phytoplasma strains from previous phytoplasma surveys

Impact Statement
Phytoplasmas are unculturable, plant pathogenic bacteria that infect and impact yield of many agriculturally important plant species.In this study, 16S rRNA gene detection and sequencing for triaging was coupled with metagenomic sequencing to determine the diversity of phytoplasma taxa and associated diseases in Australian vegetable growing regions.It is the first study to use metagenomic analysis to improve the understanding of phytoplasma diversity in Australia.None of the phytoplasma taxa that were detected were exotic to Australia, but host-and geographic-range expansions were recorded for some.Since the metagenomic approach obtains DNA sequences from all organisms in a sample, the identification of plant and insect host families, genera, or species was possible by DNA barcode analysis when they were undetermined based on morphology.This study has analysed the largest number of phytoplasma whole genomes to date (n=195) and significantly contributes to the available sequence data for these bacteria.The sequence and metadata provided in this study offer an improved understanding of the phytoplasma taxa present in the different states and territories of Australia, and contributes to improving phytoplasma taxonomy, molecular epidemiology, and diagnostics.
in Australia were obtained, and (iii) a metagenomic approach was used to obtain draft phytoplasma genome assemblies to be used for genome-based investigations into the phytoplasma taxa infecting the samples, and to identify the host when the plant or insect host identity was inconclusive based on morphology.The results of this study demonstrate the applications, benefits, and challenges of applying metagenomic sequencing to phytoplasma diversity analyses.

Sample collection, total nucleic acid extraction, plant tissue preservation, and DNA quantification
Plant samples with phytoplasma-associated symptoms, including little leaf, yellowing, phyllody, and/or stunting that were collected Australia-wide between March 2019 and June 2022 were sent to the laboratory in Melbourne, Victoria (VIC) for analysis (Table S1).Total nucleic acid was extracted from samples using an iodixanol-based phytoplasma enrichment procedure [38] or a modified CTAB-DNeasy protocol without RNase treatment [39] (Table S1 and S2).Petioles, whole leaves or leaf veins were used in total nucleic acid extractions for most samples.Phloem scrapings were sampled for woody material (e.g.Melaleuca spp.and Vitaceae spp.).When possible, a subsample of the plant material was freeze dried for at least 72 h at −50 °C in individually labelled screw cap tubes using the FreeZone 2.5 Liter Benchtop Freeze Dry System (Labconco, MO, USA) and deposited in the Victorian plant pathology herbarium (VPRI) (Table S1).Insect samples were supplied as DNA extracts (Table S1) and had been collected by suction trapping and sweep netting in Jennings, New South Wales (NSW) and Palmerston, Northern Territory (NT), respectively.Insect collections were done in these areas as plants displaying phytoplasma-associated symptoms were present nearby.Phytoplasma-positive samples collected prior to 2019 that are held at The Northern Territory Department of Industry, Tourism and Trade (NT DITT), Darwin, NT, Australia and Department of Agriculture and Fisheries, Mareeba, Queensland (QLD), Australia were also supplied as total nucleic acid extracts (Table S1).
DNA quantity was estimated using a Qubit 2.0 fluorometer (Thermo Fisher Scientific, MA, USA) with the Qubit 1X dsDNA HS Assay Kit (Thermo Fisher Scientific, MA, USA).All DNA samples were stored at −20 °C.

Phytoplasma screening and preliminary identification by universal phytoplasma 16Sr PCR, and Sanger sequencing
Screening for PCR inhibitors was done using PCR primers for the generic amplification of the bacterial 16S rRNA gene [40].A nested PCR assay using P1/P7 and R16F2n/m23sr primer pairs were used to screen recent samples for phytoplasma infection and to confirm phytoplasma presence in total nucleic acid extracts from samples collected prior to 2019 [41].These primers all bind to the phytoplasma 16S rRNA gene, apart from the P7 primer, which binds to the 5′ region of the 23S rRNA gene.The R16F2n/m23sr amplicon of some samples were cloned and screened according to [24] when a poor Sanger sequencing quality was observed in the forward and/or reverse read (Table S1) [24].All PCR amplicons were visualized by electrophoresis through 1 % agarose gels stained with SYBR Safe DNA gel stain (Thermo Fisher Scientific, MA, USA).PCR amplicons of the expected size were purified and directly Sanger sequenced (Macrogen, Seoul, South Korea).The identities of the Sanger sequenced PCR amplicons were determined by blastn analysis [42] at the NCBI (https://www.ncbi.nlm.nih.gov/, last accessed 30 July 2022).During the blastn analyses, the top hit with the 16S rRNA gene of a 'Ca.Phytoplasma' species reference strain was used to determine the identity of the sample investigated based on the top bit score, percent identity, and e-value, as well as considering the query coverage.

Library preparation and sequencing
Libraries were prepared with fragment sizes between 300 and 500 bp by following the manufacturer's protocols of the NEXTFLEX Rapid XP DNA-Seq Kit (PerkinElmer, MA, USA) with the Unique Dual Index (UDI) barcodes (PerkinElmer, MA, USA) or the Nextera DNA Flex Library Preparation Kit (Illumina, CA, USA) with the IDT for Illumina Nextera DNA Unique Dual (UD) Indexes (Illumina, CA, USA) (Table S1).The libraries were pooled and size-selected prior to sequencing according to [38] and sequenced with Illumina platforms including the MiSeq (2×250 bp), HiSeq2000 (2×150 bp), NovaSeq 6000 on an SP flow cell (2×250 bp), or NovaSeq 6000 on an S1 flow cell (2×150 bp).All the library pools sequenced on the NovaSeq 6000 platform were treated with the Illumina Free Adapter Blocking Reagent (Illumina, CA, USA) according to the manufacturer protocols prior to HTS to mitigate aberrant sequencing results caused by the presence of free adapters.
The phytoplasma genome sequences of 25 phytoplasma-positive samples have been used in previous studies [24,37,38].The genome sequences and metadata associated with these samples were included in this study, however, they were gathered during the sample collection period of this study (Table S2).

Read quality filtering, metagenomic assembly, and identification of phytoplasma-derived contigs and gene annotations
Illumina read filtering and adapter trimming for each sample was done using FastP [43], removing reads shorter than 50 bp and with a Phred quality score (Q score) below Q20.The trimmed reads were used in a metagenomic assembly pipeline according to [38], which implements metaSPAdes version 3.15.2[44,45].Phytoplasma-derived contigs were identified and retrieved using blast+v2.11.0 [42] and a custom grep script, respectively.Contigs shorter than 500 bp were removed using the reformat.sh script implemented in the BBMap v.38.61b software suite [46].The phytoplasma genomes were analysed in metaQUAST [47] to estimate the genome N50 values.Protein coding, tRNA, and rRNA genes were annotated and counted using Prokka [48], specifying RNAmmer for 5S, 16S, and 23S rRNA gene annotations [49].
The phytoplasma genomes obtained in this study were analysed differently based on the total genome size recovered to mitigate spurious results related to highly incomplete and fragmented genomes.The number of tRNA gene sequences annotated were used to estimate the completeness of the phytoplasma assemblies and, therefore, their suitability for the downstream genomebased analyses.Phytoplasma genomes that were larger than 300,000 bp and which encoded 13 or more tRNA genes were used in further whole-genome-based assessments to classify the phytoplasma strains (Table S1).For genomes smaller than 300,000 bp and/or encoding fewer than 13 tRNA genes, only the 16S rRNA gene was used to classify the phytoplasma strain within a 16 Sr group (Table S1).

Whole-genome analyses
Whole-genome comparisons were performed for all the phytoplasma genomes by average nucleotide identity (ANI) analysis using the ANI with MUMmer (ANIm) algorithm in pyani version 0.2.10 [50].The pyani heatmap output was manually overlaid with blastn results to characterize clusters, which did not contain a representative genome.
For further investigation into phytoplasma samples that did not clearly cluster with reference sequences in the initial ANI analysis, the coverage of aligned genomic segments in each pairwise comparison was analysed along with publicly available phytoplasma reference genomes using pyani version 0.20.10.These values are referred to as the alignment fraction (AF).
In cases where a historic strain was previously classified using in vitro RFLP but appeared to be a different species based on the maximum-likelihood tree, the recognition sites for 14 of the 17 restriction enzymes used in phytoplasma subgroup classification were visualized in Geneious Prime version 2022.2.2 (https://www.geneious.com/prime/).The three restriction enzymes that were not used as they are unavailable on Geneious Prime version 2022.2.2 are: BfaI, BstUI (ThaI), and SspI.

Identification of mixed phytoplasma infections using whole-genome analyses
Samples which demonstrated high ANI values (>90 % ANI) with more than one representative phytoplasma genome were considered to have a mixed phytoplasma infection of the phytoplasma taxa for which genome sequence data was available.The 16S rRNA gene sequence analyses, including those of the cloned PCR amplicons, as well as the number of annotated tRNA genes were revisited for the mixed infection samples identified during the blastn analyses.

Identification of unknown host species
The host family, genus, or species of some plant hosts and two insect hosts were unconfirmed based on external morphology prior to total nucleic acid extraction (Table S1).To confirm the identities of the insects to the species-level, contigs encoding cytochrome c oxidase subunit 1 (coI) were analysed [51].To confirm the family, genus, and/or species of the unknown plant hosts, the contigs encoding maturase K (matK) and ribulose-bisphosphate carboxylase (rbcL) genes were analysed [52,53].These barcoding genes were obtained from the metagenome assemblies using a custom grep script to identify the genes and their identities from the blastn results.A publicly available accession was listed as the top blastn hit when it produced the highest bit-score, lowest e-value, and highest percentage identity.In cases where multiple species in the same genus were considered the top blastn hit, only the genus was recorded for the sample and the species names were considered undetermined.In cases where multiple genera were listed as top blastn hits, the family of these samples were recorded, and the genus and species names were considered undetermined.The detection location of host species listed as top hits were determined based on searches performed at the Australasian Virtual Herbarium website (https://avh.chah.org.au/, last accessed March 2023), with hosts that are not known to be present in Australia removed from the list.

Phytoplasma-positive sample information
A total of 195 samples were collected between 1998 and 2022 and were either confirmed or suspected to be infected with phytoplasma based on previous PCR-based analyses (sequence similarity and RFLP) or disease symptoms.These samples were subsequently confirmed to be positive for phytoplasma by PCR and direct Sanger sequencing of the amplicon generated using the universal phytoplasma 16S rRNA primers.Most of the samples were collected from QLD (n=78), followed by Western Australia (WA, n=46), NT (n=32), NSW (n=29), and VIC (n=10) (Tables 1 and S1).
Phytoplasma samples analysed in this study were collected from all the states and territories of Australia, apart from the Australian Capital Territory (ACT), South Australia (SA), and Tasmania (TAS) (Tables 1 and S1).No samples were collected between 2019 and 2022 in SA due to Covid-19 travel restrictions.There was also an absence of plants showing typical phytoplasma symptoms in TAS during the 2019-2022 collection period (Callum R. Wilson, personal communication), which corresponds to previous observations of low phytoplasma prevalence for the state [54].
An asterisk (*) in the sample name(s) column indicates samples for which the host identity was determined using DNA barcode analysis.

Metagenomic sequencing output and phytoplasma genome information
After metagenomic HTS, 178 of the 195 total samples passed the phytoplasma genome criteria for further comparative genomic analyses in this study (Table S1).The Illumina sequence data output for these 178 samples ranged from 0.31 Gb (sample BAWM-193a-F1) to 32.60 Gb (sample BAWM-354A) with an average output of 4.76 Gb per metagenomic library (Table S1).The phytoplasma genome sizes of these 178 samples ranged between 321 651 bp (sample BAWM-201) and 1 488 020 bp (sample BAWM-255), with an average genome size of 632,634 bp for of all 178 phytoplasma samples (Table S1).Of the 178 draft genomes, an average of 28 tRNA gene sequences were recovered per genome and ranged between 13 tRNA genes (sample BAWM-198) to 61 tRNA genes (sample BAWM-255) (Table S1).The most tRNA genes recovered from a complete phytoplasma genome to date is 35 from the genome of 'Ca.Phytoplasma australiense' strain NZSb11 [57], indicating that the phytoplasma genome data of 13 of the 178 samples that had more than 35 tRNA genes annotated may represent tRNAs of more than one phytoplasma species in the sample (Table S1).An average of two phytoplasma rRNA genes could be annotated from these 13 genomes, with a max of five rRNA genes (sample BAWM-307) and none obtained from sample BAWM-350 (Table S1).To date, two identical or nearly identical 16S rRNA genes are known to be encoded per phytoplasma genome [1,36] indicating that sample BAWM-307 potentially harbours a mixed phytoplasma infection.
Phytoplasma genome sequences that were <300 000 bp were recovered for 17 samples in six different host families (Table S1).These genome sequences ranged in size from 1784 bp (sample BAWM-004) to 289 060 bp (sample BAWM-184), with an average size of 114 181 bp (Table S1).The average data output for these 17 samples was 5.07 Gb, with a range of 1.28 Gb (sample BAWM-173) to 22.53 Gb (sample BAWM-189) (Table S1).An average of six tRNA genes could be retrieved from these 17 phytoplasma genomes (range 0 tRNA genes for samples BAWM-003, BAWM-004, BAWM-083, BAWM-216, and BAWM-233 to 21 tRNA genes for BAWM-183).No rRNA genes were retrieved from seven of the 17 phytoplasma genomes.These samples were not used in further genomic-based analyses as these results indicate poor-quality genomes from which limited information can be obtained, including the 16S rRNA gene-based correlation of taxon identification prior to and after metagenomic HTS [58].

Subspecies of 'Ca. Phytoplasma australasiaticum'
Based on whole-genome ANI analyses (Fig. 1; Table S1), 160 of the 178 samples (ca.90 %) used in further genome-based analyses clustered at >96 % ANI solely with representative genome sequences of 16SrII phytoplasmas.The majority of these samples classified within the 16SrII phytoplasma group were identified as 'Ca.Phytoplasma australasiaticum' subspecies, including of 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' (n=67), 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' (n=51), and strains identified as a new 'Ca.Phytoplasma australasiaticum' subspecies (n=12, referred to as 'Ca.Phytoplasma australasiaticum' taxon 1) (Fig. 1).When the 12 16S rRNA sequences of the 'Ca.Phytoplasma australasiaticum' taxon 1 samples extracted from the genomic sequences were queried further, the historic samples were classified as 'TBB' (i.e.'Ca.Phytoplasma australasiaticum') in the NT DITT phytoplasma database and their 16S rRNA genes shared the highest nucleotide sequence similarity and coverage with the 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' representative strain PR08 (99.92 % nucleotide sequence identity, 100 % coverage) in blastn analyses (Table 1).Additionally, these sequences could only be differentiated from 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' and 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' during in silico RFLP analysis by the HaeIII restriction enzyme (Fig. S1).Therefore, the identification and subsequent characterization of this 'Ca.Phytoplasma australasiaticum' subspecies was likely missed in previous analyses as the HaeIII restriction enzyme was infrequently used during in vitro RFLP analyses [13,14,16,59].These results illustrate the low resolution of the RFLP of the 16S rRNA gene sequence to delimit separate phytoplasma taxa compared to species and subspecies characterization that is possible using the genome ANI, which has also been emphasized in previous studies [36].
Together, these results suggest that the previously unrecorded taxa, 'Ca.Phytoplasma australasiaticum' taxon 1 to taxon 5, are endemic to Australia.This is supported by 'Ca.Phytoplasma australasiaticum' taxon 1 being present, but misclassified, in a historic sample collected in 2004 (Table S1) but also because these phytoplasma taxa have not been detected in any other country to date based on the 16S rRNA sequences and the limited number of 16SrII phytoplasma genomes that are publicly available.Using the genome-sequence data obtained for these strains in this study, further analyses are required and could be done to confirm whether these five new taxa are truly distinct subspecies of 'Ca.Phytoplasma australasiaticum' and not artefacts generated during metagenomic sequencing and analyses [36,37].
A putatively new species within the 16SrII phytoplasma group The phytoplasma strains obtained from samples BAWM-167 and BAWM-339 shared 100 % ANI and >80 % AF with each other and approximately 94 % ANI and <80 % AF with any phytoplasma genomes used in this study, including the closely related subspecies of 'Ca.Phytoplasma australasiaticum' (Fig. 1b, c).These results indicate that the phytoplasma strains from samples BAWM-167 and BAWM-339 may represent a novel 'Ca.Phytoplasma' species within the 16SrII group.The 16S rRNA gene analyses support the divergence of these two strains compared to other previously described 16SrII phytoplasmas.The 16S rRNA genes of BAWM-167 and BAWM-339 had 99.92 % sequence similarity to the reference sequence of 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' (100 % coverage, 2180-2211 total score) (Table S1).Further phylogenomic analyses are required to confirm whether these three taxa are truly distinct subspecies and not due to artefacts generated during metagenomic sequencing and analyses.However, the identification of this putative novel species from two distinct hosts (Medicago sativa and Ipomoea sp.) across a large geographic separation may provide positive support for the existence of this species.
This species could be endemic to Australia because these two detections are the first time this species has been detected globally based on the 16S rRNA sequences and they were detected in two geographically distinct areas in Australia (Darwin, NT and Dandaragan, WA) in different hosts and years.
Other phytoplasma taxa identified in this study, which have previously been recorded in Australia Strains of the 16SrII species, 'Ca.Phytoplasma fabacearum' , formed the third largest cluster (n=16) in the ANI analyses, while the cluster containing the group 16SrXXXVIII species 'Ca.Phytoplasma stylosanthis' was the fifth largest (n=6) (Fig. 1).The single WaLL phytoplasma (group 16SrII) strain from BAWM-227, showed between 90 and 95 % ANI with four 16SrII phytoplasma *The previous identification of BAWM-227 as a WaLL phytoplasma based on RFLP in the nt DITT phytoplasma database was used as no fulllength 16S rRNA gene exists for this taxon [13].However, the 16S rRNA gene of this phytoplasma shared 99.65 % nucleotide identity with that of 'Ca.Phytoplasma asiaticum'.†Mixed infection was identified for these samples by Sanger sequencing cloned 16S rRNA PCR amplicons.species previously determined to be closely related to each other [37], namely 'Ca.Phytoplasma citri' , 'Ca.Phytoplasma asiaticum' , 'Ca.Phytoplasma gossypii' , and 'Ca.Phytoplasma crotalariae' (Fig. 1).The two 'Ca.Phytoplasma bonamiae' strains (group 16SrII) formed their own cluster with 100 % ANI between them in the pairwise analysis and shared the next highest ANI with strains of 'Ca.Phytoplasma fabacearum' (ANI of <94 %, Fig. 1).The two 'Ca.Phytoplasma melaleucae' strains (group 16SrXV) formed their own cluster with 100 % ANI between them in the pairwise analysis (Fig. 1).The ViLL phytoplasma strains (16 Sr group unassigned) clustered with each other, but with a pairwise ANI of ca.97 % to each other.
The blastn analyses of the 16S rRNA genes of all these taxa supported the ANI results (Fig. S1A), and the WaLL phytoplasma was characterized based on RFLP in previous analyses (NT DITT record; Table 2) [13].Further, the ANI results of WaLL and ViLL phytoplasmas suggest that these taxa could be described as two novel 'Ca.Phytoplasma' species (ANI <95 % with any other genome available for described phytoplasma species).Future work is required to determine whether the WaLL and ViLL phytoplasmas meet the updated requirements for the description of novel 'Ca.Phytoplasma' species [31] and whether the two ViLL strains represent two individual subspecies.Additionally, the competent insect vector species of WaLL and ViLL taxon remain to be determined.

The identification of mixed phytoplasma infections
Close analysis of the ANI heat map revealed evidence of mixed phytoplasma infections (Fig. 1), where several samples showed a high ANI with two representative genomes.Samples BAWM-044, BAWM-186, and BAWM-316 had mixed infections comprising of 'Ca.Phytoplasma fabacearum' and 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' .blastn analyses of the 16S rRNA genes obtained for these samples identified the presence of 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' but failed to identify 'Ca.Phytoplasma fabacearum' (Table S1).
When revisiting the number of tRNA gene sequences annotated from these 12 samples identified to contain mixed phytoplasma infections, more than 35 tRNA gene sequences were identified from all of these samples apart from sample BAWM-307 from which 32 tRNA genes were obtained (Table S1).Further, only one sample for which no mixed phytoplasma infection was identified using the ANI approach encoded more than 35 tRNA genes (BAWM-311, n=40 tRNA genes).These results highlight that the metagenomic sequencing, assembly, and tRNA annotation approach used in this study can sufficiently resolve the distinct tRNA genes encoded by each phytoplasma species and, thus, support the utility of using the tRNA count as an indicator of mixed phytoplasma infections in a sample in addition to genome completeness criteria that has been proposed previously [60,61].
The results of species identification based on the 16S rRNA gene and whole-genome comparisons emphasize several important implications of these approaches to phytoplasma identification.Firstly, the 16S rRNA gene sequences obtained by either method often only represented one of the phytoplasma taxa involved in the mixed infection.It is likely that this arose due to differences in the titres of the multiple phytoplasma taxa in the sample, with only the one gene sequence being obtained by direct Sanger sequencing or more genomic sequence data obtained from the phytoplasma present at the higher titre.Alternatively, the multiple 16S rRNA gene sequences of closely related phytoplasma taxa obtained in a sample may have been missed during the process of obtaining the consensus sequence from both Sanger sequencing and metagenomic HTS data.
Phytoplasma identifications made for samples that could not be used in ANI analyses For the samples for which insufficient phytoplasma data was obtained for ANI analyses, the 16S rRNA sequences shared high sequence similarity with 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' (n=9 sequences), 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' (n=7 sequences), 'Ca.Phytoplasma australiense' (n=1 sequence, BAWM-189), and the ViLL phytoplasma (n=1, BAWM-337) (Table S1).These results highlight the pitfall of applying metagenomic-based approaches to identify the species diversity of phytoplasmas obtained from a diversity of host species.Specifically, it may be difficult to get sufficient data from hosts that harbour low titre infections for comparison with other taxa from metagenome sequencing.Whereas PCR of the 16S rRNA gene enriches for these regions of interest that provide taxonomically informative information [62], albeit at a low taxonomic resolution [37].This highlights the need to sequence additional genomic regions in diversity studies, and/or to have pre-sequencing enrichment tools for phytoplasma cells or DNA to improve genome sequence retrieval and assembly for genomic analysis relevant for applications such as taxonomy [38,[63][64][65] Host and geographic ranges of the phytoplasma taxa identified in this study

Identification of unknown hosts
Eight weed samples were either unknown or not confidently identified to family, genus or species based on visual identification.The combination of two barcodes, matK and rbcL, extracted from metagenomic data as well as occurrences of the identified species in the geographic region of collection, as recorded in the Australasian Virtual Herbarium, were used to indicate the host plant species (Table 3).
Two plants visually identified as members of the Lamiaceae were identified as Asteraceae members, including Chromolaena odorata (BAWM-316) and an undetermined species likely in the Chromolaena or Praxelis genera (BAWM-321).(Tables 3 and S1).One weed species (sample BAWM-319) was also visually identified as a member of the Lamiaceae but was identified as a member of the Malvaceae instead based on the DNA barcodes (Sida sp., BAWM-319) (Table S1).The host species of BAWM-319 is likely Sida rhombifolia as this species is present in QLD where the sample was collected, whereas Sida fallax reports were made from WA in the Australasian Virtual Herbarium records.Species in the Sida genus have been reported as a host in Australia previously [12].The results for these three samples are consistent with other phytoplasma detections made in Australia, where members in the Lamiaceae are not known to host phytoplasmas in Australia but where several Asteraceae and Malvaceae species have been recorded [27].
Samples BAWM-037, BAWM-057, BAWM-079, and BAWM-302 were all recorded as 'unknown weed' species based on morphological observations (Tables 3 and S1).Samples BAWM-057 and BAWM-302 were subsequently identified to the genus-level based on BLASTn of the matK and rbcL gene sequences (both Solanum spp., Solanaceae).The host of BAWM-079 was also identified as a Solanum sp.(Solanaceae) by the host DNA barcode analyses (Table 3, either Solanum nigrum or Solanum rostratum).However, it is likely that the host species of BAWM-079 is Solanum nigrum as this plant species has a wide geographic distribution in Australia based on the Australasian Virtual Herbarium, including in the NT where BAWM-079 was sampled, and due to the high blastn results for this gene with Solanum nigrum (228 total score, 5E-60 e-value, 99.2 % identity; GenBank accession number: M588530).Solanum rostratum is not present in the NT according to the Australasian Virtual Herbarium.However, the rbcL gene sequence had higher blastn scores with Solanum rostratum and it was, therefore, recorded in Table 3.Based on blastn of the matK and rbcL genes obtained from the metagenomic data, the host of BAWM-037 was identified as a Sida sp.(Malvaceae).
The host species of BAWM-037 could not be determined due to the inconsistencies between the top hit species listed for the two DNA barcodes (Table 3).
Two insect hosts were identified as Orosius sp.based on their external morphology (samples BAWM-342B and BAWM-343A).
Using sequence analyses of their coI gene, both samples were identified as Orosius argentatus.Based on several studies, Orosius argentatus is a known phytoplasma vector in Australia and is detected across a broad geographic range in Australia [9,29,30,66].
These sequence-supported identifications of plant or insect hosts at the species-, genus-, or family-level when they were not known based on visual inspection highlight the added benefit of a metagenomic-based approach to investigating phytoplasma diversity and their host associations.However, the host species listed using this approach are considered preliminary indications of the host taxa sampled, especially when (i) the nucleotide identities of the DNA barcodes were not identical to those of voucher specimens on the NCBI, despite the nucleotide identities being above 90 % in all cases in this study (Table 3), (ii) recording species-level identifications, and (iii) considering that some barcodes may be missing for the species under investigation but for which they are available for a closely related species [67,68].This is due to the limitations of the available and well-validated plant DNA barcodes in the public databases.
Of the 24 phytoplasma-positive weed samples, 16 were unidentified to the genus-or species-level based on morphology or DNA barcode analysis (Tables 3 and 4).4).
Orosius species, including Orosius argentatus, have been identified as confirmed or putative vectors of diseases thought to be associated with 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' and 'Ca.Phytoplasma stylosanthis' , such as tomato big bud, tobacco little leaf, and legume little leaf diseases [15,27,29].The detection of 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' and 'Ca.Phytoplasma stylosanthis' in Orosius argentatus in this study (Tables 1 and S1) is therefore consistent with the results of previous collection efforts in Australia [15] and it may be that this leafhopper species is a vector of several phytoplasma taxa.However, the detection of phytoplasma strains from the total nucleic extracts from insect whole bodies or

Continued
subsampled body sections, as done in this study, does not provide definitive evidence of vector competence and further transmission trials are required.A comprehensive list of insect species that serve as competent vectors of phytoplasma diseases of vegetable crops in Australia remains to be determined.
The observation that 'Ca.Phytoplasma australasiaticum subsp.australasiaticum' and 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' were the most frequently detected, the most widespread geographically, and detected from the most host families and species in this study is not surprising (Figs 1-3).These two species have historically been detected from a broad range of hosts in Australia, including many crop species [27].'Ca.Phytoplasma australasiaticum subsp.australasiaticum' and 'Ca.Phytoplasma australasiaticum subsp.ipomoeae' are major threats to crop production in Australia as they are commonly associated with crop species in many vegetable growing areas around the country, and disease incidences can be high [72].Additionally, the symptoms they are associated with directly affect crop yield (e.g.phyllody).Further research is required to determine whether these two subspecies have distinct host ranges, symptomologies, or vector species for a better understanding of their biology and how to mitigate outbreaks of diseases associated with them.
This taxon may present a moderate threat to crop production in Australia, with the potential to affect Solanaceae hosts in particular.This is due to this taxon being detected from several crop hosts displaying symptoms that directly affect crop yield but also due to the large geographic range of the detections made in this study (from the NT, QLD, and VIC).Additionally, the detection of 'Ca.Phytoplasma australasiaticum' taxon 1 was likely missed in previous RFLP-based analyses done to assess taxon diversity in Australia and may, therefore, have a broader host and geographic range than what is reported in this study.Further research is required to investigate the prevalence and vector(s) of this taxon.
The ANI results indicated that these taxa were distinct and are likely novel subspecies of 'Ca.Phytoplasma australasiaticum' (Fig. 1), representing the first time these taxa have been identified in Australia and likely globally.These novel detections are potentially due to these identifications being missed previously due to the high sequence similarity of their 16S rRNA gene to other described 16SrII phytoplasma taxa (>99 % nucleotide sequence similarity; Table S1).Additionally, these novel detections may not have been made in the past as four of these novel phytoplasma taxa were detected from either weed species or from phytoplasma crop host species that are first records of phytoplasma hosts for Australia (Table 4).
The potential threat of these phytoplasma taxa remains to be determined as only a few detections in plant hosts were made in this study and their detections were likely missed in previous low resolution (RFLP-based) analyses (Table S1).Additionally, no vector species have been identified for these taxa.These results highlight the importance of sampling weed species in and around cropping areas, as well as collecting diverse species of symptomatic hosts in an area.These taxa need to be assessed further to determine whether they are truly distinct subspecies, which can be done using further comparative and phylogenomic assessments in future [36,37].
These results highlight that Fabaceae crops across a broad geographic range in Australia are at a high risk of losses due to infection by 'Ca.Phytoplasma fabacearum' , although some Asteraceae, Cucurbitaceae, and Solanaceae hosts might also be at risk.This is also supported by reports of high incidence of phytoplasma diseases in Australia likely attributed, in part, to 'Ca.Phytoplasma fabacearum' [72].
Historic samples: 'Ca.Phytoplasma bonamiae' and the Waltheria Little leaf phytoplasma (16SrII) Strains of 'Ca.Phytoplasma bonamiae' (n=2) identified from Bonnamia pannosa and the WaLL phytoplasma (n=1) identified from a Waltheria sp. were only identified from the historically collected samples analysed in this study from QLD and the NT, respectively (Figs 2 and 3b, c; Table S1).No new host species or geographic range expansions are, therefore, reported for these taxa.'Ca.Phytoplasma bonamiae' was associated with little leaf symptoms in both samples, while the WaLL phytoplasma was associated with both little leaf and witches'-broom symptoms.No insect vector species have been identified for these phytoplasma taxa.
This study provides a full-length sequence of the 16S rRNA gene as well as genomic data for the WaLL phytoplasma for the first time.This sequence data identified this phytoplasma as a member of the 16SrII group (Figs 1 and S1; Table S1), which confirms previous reports based on nucleotide analysis of regions within the 16S rRNA gene [13].Additionally, 16 Sr rRNA and ANI sequence analysis showed that the 'Ca.Phytoplasma bonamiae' and WaLL phytoplasmas were close relatives of the 'Ca.
Phytoplasma fabacearum' strains (Figs 1 and 2) [13].However, since both these phytoplasma taxa are infrequently detected in crop plants, and since 'Ca.Phytoplasma bonamiae' has only been detected from the Australian native plant Bonamia pannosa (based on 14 samples in the NT DITT database and [15,27]), they are both unlikely to pose a major threat to crop production in Australia.
It is likely that the WaLL phytoplasma strain can be described as a novel species as it shares less than 96 % ANI with other phytoplasma species (Fig. 1).Eight WaLL strains are in the NT DITT database and future work can investigate whether these WaLL phytoplasma strains further fulfil the updated guidelines to be described as a novel 'Ca.Phytoplasma' species should more sequence data be made available for them [31].
'Ca.Phytoplasma planchoniae' (16SrII) 'Ca.Phytoplasma planchoniae' was detected from a Planchonia careya host sampled in QLD (Figs 2 and 3b) that displayed little leaves and witches'-broom symptoms (Fig. 3c).'Ca.Phytoplasma planchoniae' has previously been detected in Australia and has only been associated with native plant Planchonia careya in far north QLD [73].Due to its narrow host range in a non-crop species and its restricted geographic range, 'Ca.Phytoplasma planchoniae' is unlikely to pose a major threat to crop production in Australia.

Potentially new 16SrII species
Strains of the potentially new 16SrII species were detected from an Ipomoea sp.(Convolvulaceae, sample BAWM-339) and from a Medicago sativa sample (Fabaceae, sample BAWM-167) from the NT and WA, respectively (Figs. 2 and 3b; Table S1).Both hosts showed symptoms of little leaf and witches'-broom, but the Medicago sativa sample BAWM-167 also showed symptoms of yellowing (Fig. 3c and Table S1).While further investigations are required to determine whether these two strains belong to a novel 'Candidatus Phytoplasma' species within the 16SrII group, support for their delimitation include the observation that (i) they produced ANI and AF values below the within-species threshold (<95 % and <80 %) with already described 'Candidatus Phytoplasma' species, and (ii) more than one strain of this potential species was identified from distinct hosts from different areas in Australia.It remains to be determined what threat to crop production in Australia this taxon presents as only these two strains were identified in this study with the competent or putative vector species unknown.
Since 'Ca.Phytoplasma stylosanthis' had previously only been reported in NT, QLD, and NSW [13,18,74], the Solanum tuberosum sample represents both a host and geographic range expansion for this phytoplasma (Table S1) as described previously (sample VPRI 43683 [24]).Since this phytoplasma has been identified from a broad range of crop species across a large geographic area, 'Ca.Phytoplasma stylosanthis' has the potential to be associated with reductions of economically important crops such as Carica papaya [15,27].
Group 16SrXXV Phytoplasma samples -'Ca.Phytoplasma melaleucae' 'Ca.Phytoplasma melaleucae' was detected in QLD (n=1; BAWM-155) and WA (n=1; BAWM-354) (Fig. 2) and were only detected as single infections from Melaleuca spp.(Myrtaceae) in Australian regions above the Tropic of Capricorn (Tables 1 and  S1).This is the first report of 'Ca.Phytoplasma melaleucae' for WA and the furthest west occurrence of this phytoplasma.Prior to the present study, this phytoplasma was only reported from far north QLD and the Western Province of Papua New Guinea in Melaleuca spp., with one case reported for Synsepalum dulcificum (Sapotaceae) [16,37].The two samples analysed in this study displayed little leaf and witches'-broom symptoms (Fig. 3c and Table S1), which is consistent with previous detections [16].At present, this phytoplasma is unlikely to pose a major threat, if any, to vegetable crop production in Australia due to its restricted host range to non-crop hosts.

Phytoplasma 16Sr group unassigned -vigna little leaf phytoplasma
The ViLL phytoplasma was detected in the NT and WA (n=1 per state/territory; Fig. 2).Sample BAWM-245 represents a host and geographic range expansion for the ViLL phytoplasma, being detected for the first time in WA and in a Catharanthus roseus sample (Apocynaceae; Table S1; [27]).A second host expansion for this phytoplasma and new phytoplasma host for Australia was Momordica charantia (Cucurbitaceae), detected in the NT (sample BAWM-336) where there was a high incidence of disease (70-80 % of crop affected, in-field observation by S. Bond).Prior to this study, this phytoplasma taxon was only reported in Australia from within or near Katherine and Darwin in the NT [13,15,74].Both samples showed little leaf symptoms, however symptoms were recorded (Fig. 3a).The most symptom types (n=12) were recorded for the plants classified within the Solanaceae, followed by those in the Asteraceae and Fabaceae plant families (n=5 each).No symptoms were provided for 12 samples in this study (Fig. 3a; Table S1).
No additional symptoms were observed for the mixed infection compared to samples where only a single phytoplasma taxon was observed (Fig. 3c; Table S1).While these associations need to be investigated more thoroughly as only a few host species overlapped between the single and mixed infection samples in this study, these observations have been reported previously [74].
The association of all plant symptoms with phytoplasma infection is, however, difficult to disentangle as phytoplasmas remain to be cultured and, as such, Koch's postulates cannot be fulfilled.Abiotic factors, herbicide treatments, insect damage, the presence of other microbes, viruses, or a combination of these factors could also contribute to the symptoms that are presented by the plant hosts [79,80].

CONCLUSIONS
In this study, phytoplasma-infected crop and non-crop hosts from historic collections and contemporary collections (2015 to 2022) from vegetable growing regions around Australia were metagenomically sequenced to identify the crop-infecting phytoplasma taxa and potential alternative hosts.A total of 15 distinct phytoplasma taxa were identified from the metagenomic data obtained for these hosts (Figs 1 and 2).'Ca.Phytoplasma australasiaticum' subspecies and 'Ca.Phytoplasma stylosanthis' were two of the most frequently detected taxa identified, and from the broadest range of hosts and locations sampled across Australia (Fig. 2).Additionally, six previously undescribed phytoplasma taxa were identified from the samples analysed in this study, namely: 'Ca.Phytoplasma australasiaticum' taxon 1 to 5, and a potentially new 16SrII phytoplasma species.A few phytoplasma taxa were infrequently detected in this study, with some only associated with diseases in non-crop plants and, therefore, likely pose a low threat to crop production in Australia (e.g.'Ca.Phytoplasma melaleucae').Five different phytoplasma mixed infections were also identified (Figs 1 and 2).An updated list of phytoplasma 16 Sr groups, species, subspecies, and unclassified taxa present in Australian vegetable growing areas, as well as the prevalence and combinations of mixed phytoplasma infections was therefore provided by this study.A list of symptoms per host (Fig. 3a) and per phytoplasma taxon (Fig. 3c) are also provided and, with the previous literature, will aid in-field detections of phytoplasma associated disease in crop and non-crop plant hosts (Table S1).
PCR of the 16S rRNA gene using universal nested phytoplasma primers combined with direct Sanger sequencing was sufficient as a triage tool to screen and provide a preliminary identification of the phytoplasma taxon present in every sample analysed in this study.However, it lacked the taxonomic resolution afforded by the ANI analysis of draft metagenomic-assembled phytoplasma genomes (Fig. S1), emphasizing results from other studies [36].Additionally, the PCR-based approach often failed to accurately identify mixed infections (Table S1), which has been reported previously for Sanger sequencing of the PCR amplicon obtained directly from a sample [74].The metagenomic-based approach employed in this study based on whole-genome ANI, however, was able to resolve strains to the subspecies-level and could identify the presence of a mixed phytoplasma infection in a single sample.An additional benefit of using the metagenomic approach during phytoplasma collection was that it allowed for host taxa to be identified through the use of genetic barcodes present in the metagenomic dataset when they were unable to be resolved to the family-, genus-, or species-level based on morphology (Table 3).Together, these results provided more informative data with a more precise assessment of the prevalence and host range of phytoplasmas in vegetable growing regions in Australia compared to previous studies, which could only use RFLP or sequence analysis of the 16S rRNA gene [12,13,15,74].The results presented in this study highlight the benefits of combining metadata (host, location, date, etc.) and metagenomic sequencing for phytoplasma diversity assessments and to understand their epidemiology.
Sufficient phytoplasma genomic data was obtained for 178 (12 mixed infections) of the 195 symptomatic samples for genome based sequence analyses and to be submitted to public sequence repositories (excluding mixed infection samples).The dataset presented here is the largest contribution of phytoplasma genome sequences from a single study to date, increasing the number of publicly available sequences from 47 [81] to a total of 213 (when excluding samples with mixed phytoplasma infections).The incomplete and draft phytoplasma genomes sequenced in this study have significantly increased the taxon sampling of subclade II, which is one of three subclades described in [82] (Fig. S1).The work presented here was possible due to the ever-decreasing cost of HTS and the increased volume of sequence data generated.The phytoplasma genome data obtained in this study can be used in future research to improve phytoplasma taxonomy and diagnostics, and will assist in genomic epidemiology analyses.The reliable genome sequence assemblies will also serve as a resource from which genes involved in symptomology and host/vector interactions can be investigated when combined with the appropriate metadata, as well as comparative and functional analyses.
Together, this genome resource will contribute significantly to the knowledge of phytoplasma biology, ecology, and can be used to inform management practises to help mitigate or prevent losses associated with major phytoplasma outbreaks in Australia.

Fig. 1 .
Fig. 1.Whole-genome comparisons for phytoplasma genome sequence data obtained for 178 samples.(a) ANI heatmap, generated by pyani version 0.2.10 using the ANIm algorithm, for all strains sequenced in this study alongside representative and publicly available genomes.Some clusters are highlighted using brackets.(b) ANI percentages and (c) alignment fractions (AF) in each pairwise comparison of samples that did not cluster with representative genomes in Fig. 1a.The genomes of representative strains and publicly available are shaded in grey.See colour gradient representing the percent identities in the heatmaps of (a) and (b) or the AF per genome in (c).

Fig. 2 .
Fig. 2. Map of Australia showing the number of phytoplasma-positive samples collected per state or territory, with pie charts illustrating the proportions of ANI identified phytoplasma taxa identified per state or territory (see key below for descriptions of colour-coding).The scale on the right indicates the number of samples collected for each state or territory, with the number in brackets indicating the total number of ANI-identified samples per location within the map area.Abbreviations: ACT, Australian Capital Territory; NSW, New South Wales; NT, Northern Territory; QLD, Queensland; TAS, Tasmania; VIC, Victoria; WA, Western Australia.

Fig. 3 .
Fig. 3. Bar graphs indicating the relative abundances of (a) symptom types recorded for each plant host family analysed in this study (n=176 samples); (b) the ANI-identified phytoplasma taxa per plant or insect host family analysed in this study (n=178 samples); and (c) the symptom types recorded for each ANI-identified phytoplasma taxon analysed in this study (n=176 samples).Numbers in the bar graphs indicate the total number of samples.Colour legends are shown above each graph.

Table 1 .
Summary of all host species, genera, and families investigated in this study and associated metadata organized according to their sampling location in Australia (state/territory and closest town/city).Metadata recorded included the sample names, sampling years, and the phytoplasma taxa identified based on blastn or ANI analyses performed in this study.Original sample names, if provided, are recorded alongside the corresponding 'BAWM' name in TableS1

Table 2 .
Summary of blastn top hits of the 16S rRNA gene sequences obtained for the samples investigated in this study, including the putative phytoplasma taxon identified, the number of samples with this result, and the range of percent of nucleotide identities shared with the top hit

Table 3 .
Summary of samples for which initial host identifications were unresolved to the family-, genus-or species-level based on visual inspections and for which additional gene regions obtained from metagenomic data were used to determine the host identity.The genes used for plant host identifications included the maturase K (matK) and ribulose-bisphosphate carboxylase (rbcL) genes, and the cytochrome C oxidase subunit 1 (coI) gene was used for insect identification.The e-value, percent identity, and bitscores of the top blastn hit(s) for the sample are provided to illustrate the support for the gene-based host identification.na=Not applicable

Table 4 .
List of phytoplasma hosts investigated in this study, characterized based on whether the host was recorded previously as a phytoplasma host/ putative vector or not, and whether these hosts were classified as crop (C), insects (I), native plant (NP), ornamental (O), or weed (W) in this study.A list of the phytoplasma taxa that were identified are listed for the respective host